In this section, we will be using dimensionality reduction techniques to gain insights on public transit data for major US cities. The dataset for this comes from the American Public Transit Association Ridership Report, which contains details about public transit ridership from 2022. 1
While this dataset does contain some information on the volume of ridership, unsupervised learning is generally done without a target variable or known relationships within the data. Thus, the features of this dataset are city population, city area (square miles), average cost per trip (dollars), average fare per trip (dollars), and average miles per trip, where the observation unit for each record is an individual city. In practice, all of these features could be factors in understanding the health of a public transit system, as they all either provide information on the city itself, the conditions for the riders, or the cost for the city. Thus, the objective of dimensionality reduction is to discover things about these features and how they interact with one another.
To do dimensionality reduction, there are two common methods: Principal Component Analysis (PCA) and T-distributed Stochastic Neighbor Embedding (TSNE). Both will be applied to this dataset. We will be using the following Python libraries to accomplish these:
numpy for obtaining eigenvalues and eigenvectors
sklearn for implementing PCA and TSNE
matplotlib and seaborn for visualizations
Implementation
Dimensionality Reduction with PCA
Code
import pandas as pdcities = pd.read_csv('../data/cleaned_data/apta_cities_cleaned.csv')cities = cities.drop(columns=['Unnamed: 0'])cities.head()
City
Population
Area
Cost_per_trip
Fare_per_trip
Miles_per_trip
Total_trips
Trips_per_capita
0
Seattle--Tacoma, WA
3544011
982.52
13.906032
1.570667
5.786344
130093841
36.708080
1
Spokane, WA
447279
171.67
13.433827
0.988308
4.772569
6995911
15.641045
2
Yakima, WA
133145
55.77
19.720093
1.112531
5.179168
513484
3.856577
3
Eugene, OR
270179
73.49
10.851494
2.753356
3.684118
5296214
19.602612
4
Portland, OR--WA
2104238
519.30
10.804361
1.025659
4.011388
56312874
26.761647
A crucial step for PCA is first obtaining eigenvalues and eigenvectors to figure out the properties of the feature matrix. This process is printed below.
Code
import jsonimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import silhouette_samples, silhouette_scoreX = cities.drop(columns=['City']).to_numpy()print('NUMERIC MEAN:\n',np.mean(X,axis=0))print("X SHAPE",X.shape)print("NUMERIC COV:")print(np.cov(X.T))from numpy import linalg as LAw, v1 = LA.eig(np.cov(X.T))print("\nCOV EIGENVALUES:",w)print("COV EIGENVECTORS (across rows):")print(v1.T)
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
/var/folders/z5/l6g0391s0qg3vsbnvl7y81n80000gn/T/ipykernel_81423/3215643524.py:17: MatplotlibDeprecationWarning: Passing the emit parameter of set_xlim() positionally is deprecated since Matplotlib 3.6; the parameter will become keyword-only two minor releases later.
plt.xlim(1, 5, 1)
(1.0, 5.0)
From the plot above, we can see that greater than 95% of the cumulative explained variance is covered by 4 components. Therefore, it is reasonable to select that as the number of principal components. A good way to check the efficacy of this is to plot the covariance using seaborn from before and after PCA. Below is the covariance matrix of the feature dataset, which clearly shows a significant amount of covariance between a few of the variables.
from sklearn.decomposition import PCApca = PCA(n_components=4)pca.fit(X)data_pca = pca.transform(X)data_pca = pd.DataFrame(data_pca,columns=['PC1','PC2','PC3','PC4'])data_pca.head()
PC1
PC2
PC3
PC4
0
3.033796
0.350153
0.122935
0.019467
1
0.070137
-0.474524
0.321328
-0.191796
2
-0.723213
-0.152967
-0.292844
-0.111154
3
0.051012
-0.449029
0.788093
-0.746024
4
1.584307
-0.407123
0.325299
-0.194906
After applying PCA, below is another heatmap showing the covariance between principal components, which greatly highlights the usefulness of this process. There is essentially no covariance between principal components, indicating that the 4 component selection was effective in summarizing data.
Code
sns.heatmap(data_pca.corr())
<Axes: >
Finally, below is a plot to visualize the data after selecting principal components.
Code
import matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D%matplotlib widgetfig = plt.figure()ax = Axes3D(fig)ax.scatter(data_pca['PC2'],data_pca['PC3'],data_pca['PC4'], c=data_pca['PC1'])ax.set_title("3D Plot of Principal Components")ax.set_xlabel('PC2')ax.set_ylabel('PC3')ax.set_zlabel('PC4')plt.show()
Dimensionality Reduction with TSNE
For implementing TSNE, we will once again be using sklearn. The TSNE() function unfortunately limits to three components, so this will mainly be used for parameter tuning to analyze different perplexities and how they affect our visualizations. The results of a couple of these implementations are below:
Code
from sklearn.manifold import TSNEX_embedded = TSNE(n_components=3, learning_rate='auto',init='random', perplexity=1).fit_transform(X)# EXPLORE RESULTSprint("RESULTS") print("shape : ",X_embedded.shape)print("First few points : \n",X_embedded[0:4,:])# PLOT plt.scatter(X_embedded[:,0],X_embedded[:,1], alpha=0.5)plt.show()
Ultimately, for this application, PCA proved to be a more useful process for understanding relationships within the feature matrix of our data. In general, PCA is ideal for preserving variance in the data, while TSNE preserves relationships more effectively. The crucial difference between the two is that PCA is a linear technique while TSNE is non-linear. For a dataset like this one, where ordering of data points is not a factor, the features were separable from one another, and the initial dimensionality is quite low, PCA is likely to be more effective.
Footnotes
“Raw monthly ridership (no adjustments or estimates),” Raw Monthly Ridership (No Adjustments or Estimates) | FTA, https://www.transit.dot.gov/ntd/data-product/monthly-module-raw-data-release (accessed Nov. 14, 2023).↩︎
“Reduce data dimensionality using PCA - Python,” GeeksforGeeks, https://www.geeksforgeeks.org/reduce-data-dimentionality-using-pca-python/ (accessed Nov. 14, 2023).↩︎